A Unified Model for Spatio-Temporal Prediction
Queries with Arbitrary Modifiable Areal Units
Liyue Chen1,4, Jiangyi Fang1,4, Tengfei Liu2, Shaosheng Cao3∗, Leye Wang1,4∗
1 Key Lab of High Confidence Software Technologies (Peking University), Ministry of Education, China
2 China University of Geosciences, Wuhan, China
3 DiDi Chuxing, Hangzhou, China
4 School of Computer Science, Peking University, Beijing
chenliyue2019@gmail.com,shelsoncao@didiglobal.com,leyewang@pku.edu.cn
Abstract—Spatio-Temporal (ST) prediction is crucial for mak-
ing informed decisions in urban location-based applications like
ride-sharing. However, existing ST models often require region
partition as a prerequisite, resulting in two main pitfalls. Firstly,
location-based services necessitate ad-hoc regions for various
purposes, requiring multiple ST models with varying scales and
zones, which can be costly to support. Secondly, different ST
models may produce conflicting outputs, resulting in confusing
predictions. In this paper, we propose One4All-ST, a framework
that can conduct ST prediction for arbitrary modifiable areal
units using only one model. To reduce the cost of getting multi-
scale predictions, we design an ST network with hierarchical spa-
tial modeling and scale normalization modules to efficiently and
equally learn multi-scale representations. To address prediction
inconsistencies across scales, we propose a dynamic programming
scheme to solve the formulated optimal combination problem,
minimizing predicted error through theoretical analysis. Besides,
we suggest using an extended quad-tree to index the optimal
combinations for quick response to arbitrary modifiable areal
units in practical online scenarios. Extensive experiments on
two real-world datasets verify the efficiency and effectiveness
of One4All-ST in ST prediction for arbitrary modifiable areal
units. The source codes and data of this work are available at
https://github.com/uctb/One4All-ST.
Index Terms—unified model, spatio-temporal prediction, mod-
ifiable areal units
I. INTRODUCTION
With rapid urbanization and advancements in sensing tech-
nology, Spatio-Temporal (ST) data with location-based capa-
bilities and timestamps are being widely collected from ubiq-
uitous infrastructure and smart devices. Designing ST models
for various tasks (e.g., mobility prediction and abnormal
detection) is crucial to empower enterprises and governments
to monitor urban dynamics [1], [2], manage resources [3]–[5],
conserve energy [6], and enhance public services [7]–[9]. This
serves as the foundation for making informed decisions.
Although current ST models perform well with a specific
region partition [10]–[14], there are still gaps in providing
ubiquitous location-based services. Firstly, as shown in Fig. 1,
real-world applications rely on ST prediction with various
region specifications as a decision-making basis. For instance,
online ride-hailing platforms like Uber require demand predic-
tion tasks and taxi flow control tasks that may involve areas of
*Corresponding author
Fig. 1: Research motivation. Left: Location-based services
require ad-hoc regions for various purposes, necessitating
numerous ST models to support the service. Right: The
prediction inconsistency raised by different ST models.
1km2 [15], [16] and 0.25km2 [3], [17], respectively. Moreover,
prediction tasks may need changes in their analysis areas as
urban traffic patterns differ throughout the day [18], leading to
corresponding changes in interested community transportation
structure [19]. When region specifications change or different
scale analyses are needed, relying on many ad-hoc ST models
can be costly. Moreover, creating models for each region
partition could result in confusing inconsistency which is
also famous in geoscience, widely known as the modifiable
areal unit problem (MAUP) [20], [21], since different region
specifications may lead to diverged analysis outcomes [22]–
[24]. For example, as shown in the right chart of Fig. 1, coarser
models may produce conflicting outputs compared to finer
ones, causing confusion about which result to use.
Hence, a fundamental research question arises: can we
build a unified ST prediction model for arbitrary Mod-
ifiable Areal Units (MAU)? Such a model can greatly reduce
the development and deployment costs and seamlessly adapt to
changes in regions of interest over time. However, predicting
for arbitrary MAU with just one model is difficult due to
variations in scale and zones within the region of interest
across different prediction tasks.
One intuitive approach is to create a fine-grained ST model
capable of producing coarser results through aggregation. This
approach has the advantage of low computational costs since
only one model is needed, and pioneering works have made
great efforts to develop fine-grained ST models [13], [25].
However, applying fine-grained ST models to coarser scales
may lead to inferior performance. As demonstrated later in
arXiv:2403.07022v1  [cs.LG]  10 Mar 2024

our experiments, aggregating the results of the fine-scale ST-
ResNet [26] model for predictions increases the RMSE (Root
Mean Square Error) by 15.2% compared with directly using
predictions of the coarse-scale ST-ResNet model.
A more effective yet costly approach is training many ad-
hoc models for multi-scale outputs and selecting the appropri-
ate scale for arbitrary region queries1. Recently, a pioneering
work [27] simultaneously performs fine- and coarse-grained
predictions using a single model through multi-task learning.
This approach reduces the number of parameters required
for obtaining two-scale predictions, compared to training two
separate models. Despite its contribution towards efficient
multi-scale ST prediction, challenges remain in providing
ubiquitous location-based services with MAU.
Challenge 1. How to equally learn multi-scale represen-
tations in a lightweight manner? Handling modifiable areal
units of varying scales requires multi-scale representations
and predictions (more than two scales), raising two issues.
Firstly, previous models use separate modules to learn multi-
scale representation [27], which can be costly as the number
of scales increases. Hence, an efficient structure for multi-
scale learning is urgently needed. Secondly, existing multi-
task learning methods balance the loss magnitude between
multi-scale tasks by manually setting weights [27], [28]. While
this method is feasible when the scale structure is relatively
simple (e.g., only two scales), it becomes cumbersome and
impractical when dealing with a large number of scales.
Challenge 2. How to effectively represent modifiable areal
units using the pre-decided multi-scale regions? The spatial
heterogeneity of ST data poses challenges for prediction [29],
which can vary across scales and zones [23], [24], leading
to inconsistencies between predictions for different region
specifications. There may exist many feasible combinations
to represent modifiable areal units. For example, Fig. 2 shows
three combinations to represent the same region of interest
based on pre-decided grid scales. How do we determine the
optimal combination for getting the most accurate predictions?
Fig. 2: Three combinations with different predictability for
representing the same modifiable areal unit.
To address these challenges, we propose a framework called
One4All-ST. Our main contributions include:
• As far as we know, this is one of the pioneer works studying
the ST prediction problem for arbitrary modifiable areal
units. It is also a useful attempt to alleviate the expensive
cost and prediction inconsistency caused by developing
multiple models for location-based service.
1Without incurring ambiguity, region queries refer to the regions of interest
queried by location-based services.
• One4All-ST includes three main components. Firstly, we
design a lightweight network with hierarchical spatial mod-
eling and scale normalization modules to efficiently and
equally learn multi-scale representations. Secondly, we for-
mulate the optimal combination problem to select appropri-
ate scale outputs for modifiable areal units and solve it with
theoretical analysis. Thirdly, we suggest using an extended
quad-tree to index the optimal combinations, enabling rapid
prediction responses in practical online scenarios.
• We extensively experiment on two real-world datasets to
verify the efficiency and effectiveness of One4All-ST in
accurately predicting arbitrary modifiable areal units.
II. PRELIMINARIES AND PROBLEM STATEMENT
Definition 1 (Hierarchical grids) We partition an area of
interest (e.g., a city) evenly into an atomic raster with totally
N = H ×W grids. The atomic raster is in Layer 1 and has the
highest resolution (e.g., 150m × 150m) by setting either the
largest H or W. Other layers obtain lower-resolution grids
by combining adjacent high-resolution grids with a window
containing K ×K grids. The stride of sliding windows is also
set to K, ensuring that each higher-resolution grid belongs
to only one lower-resolution grid in a hierarchical structure.
Specifically, the window size in Layer l is Kl and Layer l
has Hl × Wl grids, satisfying H = Hl · ξl and W = Wl · ξl
(ξl = Ql−1
i=1 Ki). For convenience, we refer to Layer l that has
larger-sized (ξl times) grids as Scale ξl (abbreviated as Sξl).
Here, Sξl = 1Hl×Wl is a matrix where all elements are one.
As shown in Fig. 3(a), Layer 3 (S4) and Layer 2 (S2) are
merged from respective previous layers using a 2×2 window.
(a) Hierarchical grids
(b) Rasterized region
(c) Example of mapping
(d) A grid combination of the above region
Fig. 3: (a): Example of hierarchical grids. (b): A rasterized
region and its assignment matrix. (c): Example of the mapping
function. (d): Example of a grid combination.
Definition 2 (Hierarchical structure) Given S1, which is the
atomic raster, and Sn, the hierarchical structure P is a set that
records all scales contained. For instance, if the size of the
window in every layer is 2×2, the hierarchical structure from
S1 to S16 would be P = {1, 2, 4, 8, 16}.
Definition 3 (Citywide crowd flow [13], [26]) The crowd
flow at time t on Layer l is a 3D tensor Xl
t ∈RHl×Wl×C,

Fig. 4: The workflow of One4All-ST system.
where C is the total number of flow measurements (e.g., inflow
or outflow). Each entry (h, w, c) indicates the value of the c-
th measurement in grid (h, w). Specifically, X1
t ∈RH×W ×C
denote the citywide crowd flow of atomic grids at time t.
Definition 4 (Rasterized region) The region is a geographic
polygon represented by a path that contains a list of geo-
coordinates ⟨(lat1, lng1), ..., (latn, lngn)⟩defining its bound-
aries. Region R can be rewritten as R = AR ⊙S1 by
rasterizing and aligning it with atomic grids (S1 = 1H×W ).
The assignment matrix AR ∈{0, 1}H×W indicates whether an
atomic grid belongs to region R by setting its corresponding
value in the matrix to 1 (AR
i,j = 1). For example, Fig. 3(b)
displays a rasterized region and its assignment matrix.
ST Prediction Problem for Modifiable Areal Units Given
a set of arbitrary rasterized regions R
=
{R1, R2, ...}
derived from various tasks, a series of historical citywide
crowd flow time slot 1 to time slot t −1 on atomic grids
{X1
1, X1
2, ..., X1
t−1}, we want to predict the crowd flow for
each rasterized region Ri in the next time slot t to minimize:
L( ˆXRi
t , XRi
t )
∀Ri ∈R
(1)
where ˆXRi
t
and XRi
t
are the predicted and ground truth crowd
flow of Ri in the next time slot t. L is the loss function (e.g.,
mean square error). To solve this actual problem, we break it
down into the following two sub-problems.
Multi-scale ST Prediction Problem Given a series of histor-
ical citywide crowd flow on atomic grids {X1
1, X1
2, ..., X1
t−1},
the hierarchical structure P, we want to predict the crowd flow
for each scale s in the next time slot t to minimize:
L( ˆXs
t, Xs
t)
∀s ∈P
(2)
where ˆXs
t and Xs
t are the predicted and ground truth citywide
crowd flow at Scale s in the next time slot t.
Optimal Combination Problem for Modifiable Areal Units
Given an arbitrary rasterized region R, the hierarchical struc-
ture P, our target is to find out the optimal combination Λ∗(R)
by minimizing the predicted error:
arg min
Λ
=
X
t
L(
X
i∈P
||λs ⊙fΘ∗(Xs
t−T :t−1)||, XR
t )
(3)
s.t. Θ∗= arg min
Θ
X
s∈P
X
t
L(fΘ(Xs
t−T :t−1), Xs
t)
(4)
s.t.
X
s∈P
As = AR
(5)
where Θ is the network parameters. AR is the assignment
matrix of region R. The combination Λ = {λs|s ∈P} in-
cludes assignment matrices at all scales within the hierarchical
structure P. Here, λs represents the assignment matrix at Scale
s. λs can be converted into atomic grids represented as As ∈
{−1, 0, 1}H×W using the mapping function As
i,j = λs
⌊i
s ⌋,⌊j
s ⌋.
Fig. 3(c) provides an illustrative example of this mapping
process. In λs, a value of ‘1’ and ‘-1’ indicates that we
should take this grid into consideration by union or subtraction.
Fig. 3(d) illustrates an example combination involving both
union and subtraction operations. The sum of combinations
across all scales must equal the rasterized region (Eq. 5).
III. SYSTEM WORKFLOW
In this section, we elaborate on the workflow of the pro-
posed One4All-ST system. As illustrated in Fig. 4, our system
comprises two stages: offline phase and online phase.
In the offline phase, our system initiates by training a
multi-scale spatio-temporal network using ST data stored
in Hive [30]. Our analysis shows that for arbitrary MAU,
the optimal combination is achieved through aggregating the
optimal combinations of decomposed hierarchical grids (The-
orem 4.1). Therefore, with the trained models to assess the
quality of combinations, our system searches for the optimal
combination for all grids within the pre-decided hierarchical
structure. Then, our system constructs a quad-tree to index the
optimal combinations of all grids and transmits the index to
HBase [31], ensuring swift responses for online predictions.
In the online phase, the deployed ST model continuously
synchronizes multi-scale predictions with HBase at preset
intervals. The region decomposition server receives region
queries from the location-based services and will decompose
them into grids with various scales. Then, the server retrieves
the optimal combination for every grid based on the index and
obtains the final prediction by aggregating all grids. On our
experiment platform (Sec. V-A5), our system’s response time
for each region query is within 20 milliseconds (Sec. V-B5),
which is sufficient for online services.
IV. METHODOLOGY
A. Framework Overview
The proposed framework called One4All-ST, has three com-
ponents: multi-scale joint learning, optimal combination search
and index, and modifiable areal units prediction (Fig. 5).

Fig. 5: Overall framework of One4All-ST.
In the multi-scale joint learning component, we propose a
hierarchical ST network for multi-scale predictions, integrating
a temporal modeling module to capture temporal dependen-
cies and a stacked hierarchical structure for efficient spatial
representation learning. Additionally, the cross-scale modeling
module enhances ST representations by leveraging information
from other scales, all of which are then fed into the multi-task
learning module for multi-scale learning.
In the optimal combination search and index component, we
first show that the optimal combination of modifiable areal
units can be achieved by aggregating optimal combinations
of decomposed hierarchical grids through union operations.
We employ a dynamic-programming search approach to find
the optimal combinations for each hierarchical grid. Then,
we also explore subtraction operations to find other feasible
and improved combinations. After completing these searches,
we construct a quad-tree to index the optimal combinations,
accelerating online prediction for modifiable areal units.
The modifiable areal units prediction component is de-
signed for online use. It first decomposes region queries from
location-based services into hierarchical grids. The final pre-
diction is derived by aggregating these grids utilizing optimal
combinations indexed from the pre-established quad-tree.
B. Multi-scale Joint Learning
The multi-scale joint learning component aims to efficiently
make multi-scale predictions. Our proposed hierarchical multi-
scale ST network (Fig. 6) comprises four modules: Temporal
Modeling, Hierarchical Spatial Modeling, Cross-scale Model-
ing, and Multi-task Learning.
Compared with previous multi-scale ST networks, our im-
provement lies in the following two aspects. Firstly, our hier-
archical spatial modeling module learns the spatial represen-
tations hierarchically by stacking the spatial modeling block
layer by layer instead of using a totally different spatial mod-
eling block for each scale as in previous research [27], [32].
This approach is more efficient for deep hierarchical structures
(as later shown in our experiment, our model achieves better
accuracy with six scales and half the parameters compared
to previous work that only used two scales [27]) since it
extracts the spatial representation of coarser grids from finer
grids in previous layers rather than extracting them from
scratch. Secondly, our multi-task learning module incorporates
a scale normalization mechanism to balance the learning
tasks of different scales and ensure balanced consideration for
every scale. This approach is more reasonable than manually
assigning task weights done in previous studies [27], [28].
1) Temporal Modeling: To capture different temporal de-
pendencies, previous research selects several slots (closeness,
period, and trend) along the time [26] and this paradigm is
widely proven efficient in related research [13], [14]. Follow-
ing this, we select recent, near, and distant atomic rasters to
predict the citywide crowd flow at t:
XCt = [X1
t−lc, X1
t−(lc−1), ...X1
t−1]
XPt = [X1
t−ld∗d, X1
t−(ld−1)∗d, ...X1
t−d]
XTt = [X1
t−lw∗d, X1
t−(lw−1)∗w, ...X1
t−w]
(6)
where d, w are the daily and weekly intervals respectively
(e.g., in a one-hour prediction task, d and w are 24 and
144). We use three non-shared convolutional layers to convert
them to temporal representations, each with D channels, i.e.,
they are all in RH×W ×D. Next, we concatenate these three
temporal features and get the fused representations at Scale 1.
h1
t = Concat(Conv(XCt); Conv(XPt); Conv(XTt))
(7)
2) Hierarchical Spatial Modeling: The spatial modeling
design for all scales includes a scale merging layer and a
spatial modeling block. The scale merging layers aggregate
adjacent grids, reducing the width and height of feature maps.
This process maps spatial representations from finer-grained
grids to coarse-grained ones. The subsequent spatial modeling
block captures spatial dependencies and learns the repre-
sentation for a specific scale. Suppose that the hierarchical
structure P = {P1, P2, ...Pn} has n scales, the spatio-temporal
representations at time t are computed by:
hPi
t
= SM(Merge(hPi−1
t
)),
2 ≤i ≤n
(8)
where SM(·), Merge(·) are the spatial modeling block and
scale merging layer respectively. Pi is a natural number
representing the scale, where, for instance, P1 always equals 1,
signifying the atomic raster (Scale 1) with H ×W grids. From
Scale 1 (i.e., P1) to Scale Pn, by adding more scale merging
layers and spatial modeling blocks, the learned multi-scale
spatial representations are denoted as {hP1
t , hP2
t , ..., hPn
t }.
Scale Merging Layer. In layer l −1, suppose we have a
H′ × W ′ × F feature map at time t, with a merging window
size of K ×K. The scale merging layer takes this feature map
as input, concatenates features within each group of K × K
neighboring grids, and applies a linear layer to reduce the
K2 × F-dimensional features back to F channels. The scale
merging layer downsamples the resolution by a factor of K ×
K, resulting in an output feature map sized H′
K × W ′
K × F.
The scale merging layer can be easily implemented using a
standard 2D convolutional layer with kernel size equal to K
and stride equal to K (i.e., Merge(·) = Conv(·)).
Spatial Modeling Block. A powerful spatial modeling
block is essential for the model to learn discriminative spatial
representations. Popular spatial modeling techniques such as

Fig. 6: The proposed hierarchical multi-scale spatio-temporal network (i.e., the multi-scale joint learning component).
ConvBlock [33] (i.e., standard convolution block), ResBlock
[26], and SEBlock [13] have been widely used in previous
ST prediction models. Swin-Transformer [34] has recently
achieved great success in ST modeling [35] and can also
be applied for spatial modeling. In this paper, we follow the
previous work [13], [36] by using squeeze-and-excitation (SE)
blocks (Fig. 7 illustrates the architecture of SEBlock).
Fig. 7: ResBlock and SEBlock.
SEBlock leverages spatial and channel-wise information
within local receptive fields at each layer without incurring
too many training parameters like attention-based methods do.
We believe the choice of spatial modeling block is a trade-off
between efficiency and performance. In practice, developers
can actually select their own preferred spatial modeling blocks,
and we also test some other alternatives, like ConvBlock and
ResBlock, in our experiments.
3) Cross-scale Modeling: Previous studies have revealed
that coarse-scale representations can benefit fined-grained
scale predictions [13], [27], making cross-scale modeling
necessary. In these studies, coarser scales are irregular and
typically use inverse mapping from small regions to large
regions to align the feature maps. Given our grid’s regular
hierarchical structure, with upper grids consistently being
integer multiples of lower grids, we can efficiently obtain
coarse-scale representations from finer scales using a bottom-
up pathway, which is similar to the feature pyramid network
[37]. Borrowing the idea from feature pyramids in computer
vision [37], we adopt a top-down lateral connection to enhance
the ST representations, which can be expressed as:
HPi
t
= hPi
t + UpSample(hPi+1
t
),
1 ≤i ≤n −1
(9)
Fig. 8 gives an illustrative example of cross-scale representa-
tion enhancement. The function UpSample(·) aligns feature
maps from coarse scales with finer feature maps by increasing
spatial resolution through nearest neighbor upsampling. Sub-
sequently, the lateral connection merges feature maps with
the same resolution from both the bottom-up and top-down
pathways via element-wise addition. This iterative process
starts from the coarsest scale and concludes at the finest scale.
Fig. 8: Illustration of the cross-scale modeling.
4) Multi-task Learning: After obtaining spatio-temporal
representations for each layer, we feed them into multiple
fully connected layers to obtain final multi-scale predictions.
It is worth noting that these fully connected layers are scale-
specific and will not share parameters across different scales.
ˆ
Xs
t = MLPs(Hs
t),
s ∈P
(10)
Based on predictions ( ˆ
Xs
t) and ground truths (Xs
t) at every
scale, we can train the multi-scale network by combining
the losses of different scales. However, a notable challenge
arises from significant differences in loss scale attributed
to diverse prediction targets across the scales. For instance,
crowd flow on the coarsest scale may be over 1,000 times
greater than that on the finest scale. These disparities may bias
the optimization of the multi-task loss toward coarse scales,
resulting in suboptimal performance on fine-grained scales.
To address this, prior studies try to balance learning tasks
by manually assigning different weights to each task (e.g.,
setting a smaller weight for the coarse scale [27]). However,
this manual weight assignment is hardly set optimally due to

our relatively deep hierarchical structure that typically consists
of 5 or 6 scales. Instead of balancing tasks at the loss level, we
propose an adaptive normalization mechanism that conducts
normalization transformation at the input level for each scale.
Specifically, the normalized input for each scale is:
˜Xs = Xs −E[Xs]
p
Var[Xs]
,
s ∈P
(11)
Then, both the inputs and their losses across all scales are
rescaled to a consistent magnitude. This approach ensures
equal consideration for each learning task across different
scales. As a result, the multi-task loss can be computed
straightforwardly as the sum of each scale’s loss without the
need to set hardly-tuned sum weights [27]:
min
Θ
X
s∈P
X
t
L( ˆ
Xs
t, ˜Xs
t)
(12)
where L, Θ are the loss function (e.g. mean square error) and
network parameters. P is the hierarchical structure consisting
of the scales used (e.g., P = {1, 2, 4, 8, 16}).
C. Optimal Combination Search and Index
Given girds with different scales, one intuitive way to get
arbitrary modifiable areal units is union operations, which is a
spatial operation that combines the geometries to create a new
geometry that represents the spatial extent in geographic infor-
mation system [38] and also widely integrated into GIS tools
likes ArcGIS [39]. The union operation follows the concept
of set-theoretic amalgamation and will output a new geometry
that covers the combined area of the input geometries. We refer
to the system of modifiable areal units that can be combined
by a union of grids of different scales as the union system.
Eq. 3 describes the objective of the combination optimiza-
tion problem that we need to find out the optimal hierarchi-
cal grid combination for an arbitrary area with the lowest
prediction errors. This problem is intractable since there are
many feasible grid combinations for R. Besides, getting the
optimal combination by brute force is time-consuming and
infeasible for the online system providing real-time spatio-
temporal prediction.
Fig. 9: Illustration of the hierarchical region decomposition.
To address this issue, we first prove that, given a union
system (i.e., only union operations can be used), the optimal
grid combination for an arbitrary areal unit can be achieved
by aggregating optimal combinations of decomposed fine-
grained hierarchical grids. This assertion holds in the absence
of further feasible combinations between hierarchical grids. In
pursuit of this, we introduce a method (Algorithm 1) dedicated
to decomposing modifiable areal units into hierarchical grids.
Its principal design is to decompose in a coarse-to-fine manner,
preventing the decomposed grids from being merged into
any coarser hierarchical grids. This ensures that there are
no additional feasible combinations between the decomposed
grids. Fig. 9 shows an example of hierarchical decomposition.
With this property, we only need to search for the optimal
combination for every preset-scale hierarchical grid, whose
search space is much smaller and can be done in an offline
manner. The following theorem ensures its effectiveness.
Theorem 4.1: Given a union system, for any rasterized
region R, which can be decomposed into a set of fine-
grained grids ˜R = {r1, r2, ..., rm} (by Algorithm 1). We have
Λ∗(R) = Λ∗(r1) + Λ∗(r2) + ... + Λ∗(rm).
Proof. Decompose R into hierarchical grids ˜R that do not
intersect and cannot be merged into coarser ones, indicating
that there are no other feasible combinations. ■
Algorithm 1: Hierarchical Decomposition
Input: An arbitrary rasterized region R, a set of hierarchical
grids {SP1, ..., SPn} in P (|P| = n)
Output: The decomposed hierarchical grids set ˜R.
1 Initialize ˜R ←∅;
2 for S ←SPn, SPn−1, ..., SP1 do
3
B ←Match(R, S) ;
4
for each b ∈B do
5
append connected grids b to ˜R ;
6
R ←R −b ;
7 return ˜R
8
9 Function Match(R, S):
10
Initialize V ←∅;
11
for each grid s ∈S do
12
if s ⊆R then
13
append grid s to V ;
14
Create a graph G using the grid in V as nodes ;
15
for each pair of nodes (u, v) ∈G do
16
if u, v share the same upper grid and are adjacent
then
17
connect the edge between u and v;
18
B ←ConnectedComponents(G) ;
19
return B
1) Optimal Combination Search with Dynamic Program-
ming for Union Operations: Theorem 4.1 reveals that to get
the optimal combinations for any given region R, we just need
to search for the optimal combination in each hierarchical grid.
For coarse hierarchical grids, potential candidate combina-
tions include either utilizing the grids directly or aggregating
finer grids with hierarchical relationships through union oper-
ations. In general, assuming the merging window size K is
a constant, for the hierarchical grids in layer l (l ≥2), there
are Pl−1
i=1
N
K2i (N = H ×W) potential combinations from the
first layer (i.e., atomic raster) to layer l −1. Therefore, for a
n-layer hierarchical structure, the number of searches is:
n
X
l=2
l−1
X
j=1
N
K2j = N( n −1
K2 −1 −1 −K2−2n
(K2 −1)2 )
(13)
This results in a time complexity of O(HWn) for the search.

Lemma 4.2: In the hierarchical structure, we can find the
optimal combinations of grids in layer l by searching only the
grids in layer l−1, once we know their optimal combinations.
Proof. Let us consider a case in Fig. 12, which applies to
other layers or grids as well. Grid A (described by the quad-
tree index in Sec. IV-C3) is located in Layer 3. The optimal
combination of Grid A is Λ∗(A), which has the following
possible situations from Layer 1 to Layer 3.
Λ∗(A) ∈{A,
AA + AB + AC + AD,
AAA + AAB + AAC + AAD + AB + AC + AD, ...}
We have found that Λ∗(AA) is the better choice between AA
and AAA + AAB + AAC + AAD. Additionally, Λ∗(AB),
Λ∗(AC), and Λ∗(AD) also have searched for the optimal
combination with the grids in the lower layer. Therefore,
we can determine that: Λ∗(A) ∈{A, Λ∗(AA) + Λ∗(AB) +
Λ∗(AC)+Λ∗(AD)}. This means that we can find the optimal
combinations of a grid in layer l by only searching for optimal
combinations in layer l −1. ■
Lemma 4.2 shows that the optimal combination of coarse
grid (i.e., Λ∗(A)) can be constructed from the optimal combi-
nation of its sub-problem (e.g., Λ∗(AA)). Hence, we employ
a bottom-up dynamic-programming schedule, facilitating the
search by traversing from the finest layer to the coarsest layer
in a single pass. As a result, in the union system, the time
complexity of searching optimal combinations for hierarchical
grids is reduced from O(HWn) to O(HW).
2) Improvement by Subtraction Operations: Theorem 4.1
guarantees the optimality in the union system. However, in
practice, the optimal combination for an arbitrary areal unit
may use operations other than union. Here, we further take
subtraction into account (i.e., by subtracting the complemen-
tary area from a coarser grid), and attempt to find other feasible
and better combinations to represent modifiable areal units.
Fig. 10: Left: Scale vs. predictability (with colored confidence
intervals). Right: A multi-grid and two single grids with
different predictability.
For instance, as shown in Fig. 10, if we take subtraction as a
potential operation, there may be at least two ways to represent
multi-grid L2. The first is using three grids (i.e., A, B, and D)
with union operations; another is using the coarse-scale grid
¯¯L to subtract the fine-scale grid C. In practice, the latter has
certain probabilities to perform better than the former one, as
coarse-scale grids may be easier to predict.
2A multi-grid consists of several adjacent grids on the same scale.
Specifically, as pointed out in previous research [24], the
Auto-Correlation Function (ACF) is a useful proxy for mea-
suring regions’ predictability. Our analysis confirms that areas
with high flows generally exhibit better predictability, as
indicated by larger ACF values (the left figure in Fig. 10). We
calculated the average ACF of each grid at different scales and
found that coarser scales are generally easier to predict, with
higher average ACF values.
Therefore, when querying a poor-predictability region where
the upper grid and complementary area have better predictabil-
ity, it may be advantageous to subtract the complementary
area from a coarser grid. For instance, in Fig. 10, Grid C
is the complementary area of multi-grid L under ¯¯L. Suppose
multi-grid L has few flows and poor predictability while grids
¯¯L (coarse-scale) and C (hotspot) are easy to predict. In this
case, subtracting Grid ¯¯L from Grid C can result in improved
estimates for multi-grid L.
Fig. 11: Example of grid and multi-grid codes.
To streamline the search process for further considering
the subtraction operation, we introduce a grid coding rule
for representing multi-grids. As illustrated in Fig. 11, this
involves using a merging window size of 2 and allowing each
multi-grid to consist of up to three single grids. Grids A-D
represent four single grids whose optimal combinations were
determined in Sec. IV-C1, while Grids E-H and I-L are multi-
grids comprising two and three single grids respectively.
The search for multi-grids is conducted after getting optimal
combinations for single grids. As a result, the search outcomes
are at least as good as those of the single grid search
that considers only union operations (see Theorem 4.3). For
example, the optimal combination of multi-grid L in Fig. 10
is selected from the following two situations.
Λ∗(L) ∈{Λ∗(A) + Λ∗(B) + Λ∗(D), Λ∗(¯¯L) −Λ∗(C)} (14)
where ¯¯L is the coarser gird containing multi-grid L in the
upper scale. During the search process, we will record the
best combinations for all the single grids and multi-grids in
Fig. 11, which can then be directly applied to determine the
optimal combinations for modifiable areal units.
Theorem 4.3: The subtraction operations lead to solutions
that are either better or equivalent to those obtained through
the optimal combination search by union operations.
Proof. Let us consider the case in Fig. 10, which applies
to other multi-grids as well. Eq. 14 shows that we obtain

the optimal combination of L by either uniting two single
grids or subtracting the complementary area from a coarser
grid. It makes the obtained solutions either be matched (i.e.,
Λ∗(A) + Λ∗(B)) or surpass (i.e., Λ∗(¯¯L) −Λ∗(C)) the combi-
nation searched by union operations. ■
3) Extended Quad-tree Index: Once optimal combinations
for each (multi-)grid are determined, they can be used for pre-
dicting region queries. Storing optimal combinations for single
and multiple grids incurs a space complexity of O(HW)
each. Notably, H and W are often larger than 100, and rapid
response times are crucial for downstream services in online
prediction scenarios. Retrieving these optimal combinations in
a linear table can be time-consuming and impractical.
Fig. 12: Example of an extended quad-tree with three layers.
Therefore, we propose using a quad-tree to index the
optimal combinations, as it is commonly used in spatial
databases [40], [41]. However, since there may be more than
four child grids for a coarse grid (e.g., a coarse grid may
have twelve child grids, including four single grids and eight
multi-grids as shown in Fig. 11), we extend the basic quad-
tree to allow nodes to have up to 12 child nodes. In Fig. 12,
we present an extended quad-tree example with three layers,
where each grid in the hierarchical structure has a unique
code. The introduction of the extended quad-tree reduces the
time complexity of retrieving the optimal combination from
O(HW) to O(log(HW)).
D. Modifiable Areal Units Prediction
1) Hierarchical Decomposition: To utilize the hierarchical
structure and benefit from the found optimal combinations for
grids, we first decompose arbitrary region queries into several
hierarchical grids by Algorithm 1. Fig. 9 gives an example of
hierarchical region decomposition, which involves three scales
of grids. The orange region query is decomposed into four
hierarchical grids consisting of a coarsest grid, two medium-
sized grids, and a multi-grid in the finest scale.
2) Grid Indexing: After decomposing the region into hi-
erarchical grids, we encode them into indexes based on the
coding rule mentioned in Sec. IV-C2. Fig. 9 gives an example
of grid indexing. The orange region query is decomposed into
four hierarchical grids, indexed as B, AB, DB, and ADL.
These indexes allow us to quickly retrieve the optimal com-
bination for the decomposed grids from the pre-constructed
extended quad-tree. We use these optimal combinations to get
the predictions of hierarchical grids (e.g., through union or
subtraction operations) and sum them to obtain the predicted
results of the region query.
V. EXPERIMENTS
In this section, we evaluate the performance of One4All-ST
on two ST datasets to answer the following questions:
RQ1: Compared to the baseline methods (including fine-
grained methods and multi-scale methods), how does our
method perform in both effectiveness and efficiency?
RQ2: How does the optimal combination benefit the per-
formance?
RQ3: Compared with previous multi-scale ST networks
[13], [27], how do our proposed components (i.e., the hierar-
chical spatial modeling and scale normalization module) affect
the performance?
RQ4: How does the hierarchical structure (i.e., the merging
window size) impact the performance?
A. Experimental Setup
1) Datasets: We collect two ST datasets including taxi trips
and freight transport orders. We choose the last 20% duration
in each dataset as the test set, the 10% data before the test
for validation, and the remaining 70% for training. We predict
taxi/truck demand for the next hour.
Taxi NYC Dateset. We collect the taxi trip dataset from
NYC’s open data portal3. The whole dataset spans over 10
years, and we use the records from Jan. 2013 to Mar. 2013,
which amount to over 36,000,000 records. These records
include pick-up/drop-off times, locations, and trip distances.
Freight Transport Dataset. The freight transport dataset is
gathered from a world-leading online transportation company,
including the freight transport orders within a metropolis from
Oct. 2020 to Aug. 2021. The freight transport order typically
takes place like this: users send orders online, and truck owners
receive orders online and provide transportation services. This
dataset contains over 7,000,000 records, with each order record
containing the start time and location (longitude and latitude).
We partition the area of interest in both datasets into
128×128 grids, each with a size of 150m×150m, which is
consistent with previous fine-grained prediction settings [13],
[42]. The hierarchical structure P = {1, 2, 4, 8, 16, 32} is
created by setting the maximum scale to 32 and using a
merging window size of 2.
2) Evaluation Metrics: We exploit two widely used metrics
including RMSE (Root Mean Square Error) and MAPE (Mean
Absolute Percentage Error) to evaluate the performance of
predictions [9], [43], [44].
3) Prediction Tasks: We choose four tasks with different
scales to evaluate the capability for predicting arbitrary MAU
in both datasets. Task 1 predicts flows on hexagons or census
tracts with an average spatial scale of 0.3 km2, enabling
detailed analysis tasks like fine-grained taxi flow prediction
[3], [17]. Task 2 predicts flows on tertiary road map segments
whose average spatial scale is 0.6 km2. Such spatial scale is
suitable for prediction tasks on function areas (e.g., residential
quarter) [45]. Task 3 predicts flows on secondary road map
segments with an average spatial scale of 1.3 km2, which
3https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

is applicable for supply-demand prediction tasks [15], [16].
Task 4 predicts flows on primary road map segments with an
average spatial scale of 4.8 km2. This scale aligns with typical
community analysis tasks [19], [46], [47]. Most region queries
in the above tasks are defined by spatial semantic boundaries
determined by visible features like streets and roads, enabling
detailed analysis of demographics and socioeconomic factors
[47]. Task 1 of the Freight Transport dataset is the only
exception, utilizing small hexagons (i.e., 350m×350m) as fine-
grained region queries. In transportation service applications
like ride-sharing, fixed-shaped region queries such as hexagons
are also prevalent [48]. We obtain the polygon boundaries of
the census tracts from NYC open data site4. We generate road
map segments using the existing road segmentation method
[49]. The road network data are from OpenStreetMap5. In
Fig 13, the upper and lower parts display the region queries
for the Taxi NYC and Freight Transport datasets respectively.
(a) Census tracts
(b) Rd. Tertiary (c) Rd. Secondary (d) Rd. Primary
(e) Hexagons
(f) Rd. Tertiary
(g) Rd. Secondary (h) Rd. Primary
Fig. 13: Visualization of region queries. Upper: Census tracts
(Task 1) and road map segments (Task 2, 3, 4) on Taxi NYC.
Lower: Hexagons (Task 1) and road map segments (Task 2,
3, 4) on Freight Transport.
4) Baselines
and
Enhanced
Methods:
We
compare
One4All-ST with the following baselines. Most baselines,
except MC-STGCN [27], are single-scale ST models that
predict at the finest atomic scale. These single-scale methods
aggregate results on atomic grids to predict region queries.
In contrast, MC-STGCN, a bi-scale baseline, simultaneously
predicts at both the finest atomic and coarse-grained cluster
scale. The clustering process takes in geographic proximity
information and historical crowd flow as described in the orig-
inal paper [27]. MC-STGCN aims to utilize cluster predictions
whenever possible since coarse scales are generally easier to
predict as shown in the left chart in Fig. 10. Specifically, for
each region query, MC-STGCN uses cluster predictions if the
clusters fall within the region query area, with the remaining
complementary area predicted at the finest atomic scale.
• HM (History Mean) predicts future demands using the
mean value of the historical records.
4https://www.nyc.gov/site/planning/data-maps/open-data.page#census
5We download OSM data from http://download.geofabrik.de/
• XGBoost [50] is a tree model that takes historical traffic
data as features.
• ST-ResNet [26] applies residual convolution networks to
capture spatial correlations.
• GWN [10] (GraphWaveNet) proposes a data-driven method
for adaptively learning spatial correlations.
• ST-MGCN [15] applies graph convolutions to capture mul-
tiple spatial relations.
• GMAN [11] proposes spatial and temporal attention mech-
anisms to capture the spatial and temporal correlations.
• STRN [13] builds a coarse-grained scale that learns global
spatial dependencies to enhance fine-grained predictions.
• MC-STGCN [27] designs a cross-scale spatial-temporal
feature learning module and gives bi-scale ST prediction.
• STMeta [14] captures multiple kinds of temporal correla-
tions as well as heterogeneous spatial correlations.
Moreover, we also implement two multi-scale models (i.e., M-
ST-ResNet and M-STRN) by extending existing models. We
train several ST models to generate predictions on the same
scales as One4All-ST. To fully use multi-scale predictions
for modifiable areal units, we applied the proposed optimal
combinations (Sec. IV-C) for these methods.
5) Implementation Details: For fair comparisons, all meth-
ods use the same temporal inputs as One4All-ST. These
inputs consist of 17 historical observations, six closeness
records, seven daily records, and four weekly records. The
only exception is HM, which uses one closeness record, three
daily records, and one weekly record through grid search. Our
experiment platform is a server with 8 CPU cores (Intel Core
i9-9900K @ 3.60GHz), 16 GB RAM, and one GPU (NVIDIA
GeForce RTX 2080). We use Python 3.6.5 with TensorFlow
(1.13.1) in Ubuntu Linux 5.19.0-43-generic release.
B. Experimental Result
1) Main Result: Here, we compare the effectiveness and
efficiency of One4All-ST with baselines (RQ1). Table I
presents RMSE and MAPE for various tasks across baselines6,
enhanced methods, and One4All-ST. Additionally, Table II
reports the computation cost of deep models. These two tables
further reveal several insightful observations.
First, recall that from task 1 to task 4, the scale of region
queries is getting coarser. In Table I, we observed that in
existing methods, deep learning methods (e.g., ST-ResNet and
STRN) perform slightly better than statistical learning methods
(e.g., XGBoost) in task 1 where most queries are on fine scales.
More importantly, when applied to coarser-scale tasks, the
advantage of deep learning methods in predicting coarse-scale
queries becomes more apparent.
Second, when comparing ST-ResNet/STRN (single-scale
prediction models) with M-ST-ResNet/M-STRN (enhanced
multi-scale prediction models), considerable performance im-
provements were observed, especially for coarse-scale tasks.
It highlights the crucial role of coarse-scale predictions for
6We also tested methods using MAE and found consistent results with
RMSE and MAPE. Due to page limitations, we only report RMSE and MAPE.

TABLE I: Results on the Taxi NYC and Freight Transport datasets. The best two results are highlighted (best is in bold and
italic, second-best is in bold). M-ST-ResNet and M-STRN are our enhanced multi-scale models from ST-ResNet and STRN,
respectively, which adopt our proposed optimal combination search module to improve performance on modifiable areal queries.
Taxi NYC
Freight Transport
Task 1
Task 2
Task 3
Task 4
Task 1
Task 2
Task 3
Task 4
RMSE MAPE RMSE MAPE RMSE MAPE RMSE MAPE RMSE MAPE RMSE MAPE RMSE MAPE RMSE MAPE
Baselines
HM
21.95
0.130
29.52
0.122
60.50
0.124
138.9
0.130
1.745
0.370
1.928
0.384
2.374
0.387
4.390
0.313
XGBoost
19.09
0.116
25.40
0.111
53.60
0.115
137.3
0.110
1.788
0.347
1.982
0.371
2.421
0.390
4.370
0.325
ST-ResNet
19.14
0.117
24.80
0.108
49.85
0.109
126.6
0.100
1.684
0.336
1.914
0.361
2.333
0.369
4.047
0.295
GWN
18.80
0.125
24.55
0.105
49.72
0.104
117.5
0.098
1.693
0.337
1.879
0.351
2.262
0.356
3.991
0.292
ST-MGCN
19.05
0.118
25.47
0.109
50.81
0.110
126.2
0.098
1.765
0.346
1.963
0.378
2.417
0.399
4.411
0.361
GMAN
18.86
0.124
25.16
0.107
50.80
0.103
123.6
0.096
1.721
0.360
1.891
0.362
2.304
0.375
4.100
0.304
STRN
18.68
0.111
24.92
0.109
51.93
0.114
131.6
0.104
1.653
0.333
1.917
0.363
2.343
0.380
4.112
0.312
MC-STGCN
19.19
0.119
25.58
0.111
51.76
0.113
126.3
0.105
1.758
0.370
1.945
0.384
2.397
0.396
4.412
0.330
STMeta
19.04
0.109
25.99
0.114
53.26
0.122
134.4
0.103
1.726
0.332
1.900
0.356
2.308
0.371
4.023
0.322
Our Enhanced Methods
M-ST-ResNet
18.14
0.108
23.58
0.103
46.21
0.102
109.9
0.083
1.683
0.336
1.856
0.344
2.241
0.350
3.769
0.275
M-STRN
18.65
0.110
24.67
0.107
49.28
0.107
121.8
0.093
1.652
0.332
1.842
0.341
2.226
0.340
3.846
0.271
One4All-ST
17.48
0.104
22.74
0.099
44.45
0.099
110.2
0.082
1.649
0.330
1.798
0.331
2.181
0.336
3.778
0.275
TABLE II: The computation cost comparison of deep models.
We report the total computational costs of six models in M-
ST-ResNet and M-STRN.
Taxi NYC
Training
Inference
# Parameters
(sec/epoch)
(sec)
ST-ResNet
21.35
4.41
0.59M
GWN
11.98
0.99
0.92M
ST-MGCN
20.52
5.37
2.51M
GMAN
34.12
0.90
0.22M
STRN
22.73
2.33
0.88M
MC-STGCN
12.17
2.68
1.68M
STMeta
20.42
4.15
1.25M
M-ST-ResNet
47.00
8.88
0.59M×6
M-STRN
55.00
3.47
0.88M×6
One4All-ST
25.54
3.65
0.72M
coarse-scale region queries. Additionally, it also suggests that
aggregating fine-scale predictions is insufficient for achieving
precise coarse-scale prediction results.
Third, although MC-STGCN (a bi-scale model) does not
perform as well as other deep methods (e.g., ST-ResNet
and STRN) on fine-scale tasks, thanks to the presence of
coarser scales, it outperforms them slightly in coarse-scale
tasks. It further confirms that coarse-scale predictions are vital
for coarse-scale region queries. Note that MC-STGCN uses
separate spatial learning modules at different scales, resulting
in much more trainable parameters (i.e., 1.68M) and increased
cost compared to other methods.
Last, as shown in Table II, One4All-ST is relatively
lightweight in parameters (even with fewer parameters than
STRN, a single-scale model). However, it leads comprehen-
sively in accuracy for all tasks. Specifically, on task 3, our
method can achieve up to 10.6% improvement over the best
baseline in terms of RMSE on the Taxi NYC dataset. On
the other side, compared with the enhanced methods (training
separately without cross-scale information interaction), our
approach achieves better results on most tasks, which confirms
the effectiveness of our hierarchical structure and cross-scale
modeling module. More importantly, our approach achieves
better or comparable performance as the enhanced methods
while using only 20% parameters (Table II), which demon-
strates the efficiency of our hierarchical ST network.
2) Analysis of Optimal Combination Search: To explore
how the optimal combination search component improves
performance (RQ2), we analyze queries whose optimal combi-
nations are achieved via union or subtraction operations. There
are three strategies for predicting arbitrary modifiable areal
units: Direct, Union, and Union & Subtraction. Direct predicts
region queries by directly summing decomposed hierarchical
grid predictions (by Algorithm 1) without considering optimal
combination search. Union applies optimal grid combinations
obtained through union operations while Union & Subtraction
considers both union and subtraction operations.
Table III lists the RMSE for these three strategies over
four prediction tasks. In general, we observe more no-
ticeable enhancements in employing optimal search with
union/subtraction, particularly for coarse-scale tasks (Task 4).
The possible reason may be that fine-scale tasks already
exhibit optimal performance through the direct decomposition
in Algorithm 1 (i.e., utilizing the largest grids smaller than
the query area for prediction); hence, the optimal search with
union/subtraction cannot find better decomposition results. To
verify this, we conduct an analysis of the queries leading
to different grid decomposition for Union/Union & Subtrac-
tion in comparison to Direct. As anticipated, for fine-scale
tasks like Task 1, only a marginal 7.14%/8.14% of queries
show distinct decomposition results between Union/Union
& Subtraction and Direct. Nevertheless, focusing on these
queries with differing decomposition, we see a substantial
8.0%/9.2% improvement in Task 4 prediction accuracy. This
underscores the value of seeking a superior decomposition

Fig. 14: Effect of hierarchical
structure.
Fig. 15: Response time to re-
gion queries.
Fig. 16: Effect of spatial mod-
eling block.
Fig. 17: Analysis of index size.
TABLE III: Results of three region query decomposition
strategies on the Taxi NYC dataset. RMSE is all the queries’
average error; Prop. (%) is queries’ proportion of different grid
decompositions for Union/Union & Subtraction in comparison
to Direct; Imprv. (%) is the prediction accuracy improvement
of Union/Union & Subtraction compared to Direct specifically
on these differently decomposed queries.
Direct
Union
Union & Subtraction
RMSE
Prop.
Imprv.
RMSE
Prop.
Imprv.
RMSE
Task 1
17.53
7.16%
1.2%
17.51
8.14%
2.0%
17.48
Task 2
23.02
10.1%
3.5%
22.75
12.9%
5.5%
22.74
Task 3
45.41
11.8%
5.8%
44.62
16.5%
7.1%
44.45
Task 4
113.8
11.6%
8.0%
110.6
12.1%
9.2%
110.2
through union/subtraction for performance enhancement.
Moreover, a comparison between Union & Subtraction and
Union demonstrates an increase of improved queries, rising
from 11.8% to 16.5% for Task 3. This indicates that the
introduction of subtraction operations can indeed yield better
region query decomposition results in practical applications.
Notably, the advantage of utilizing Union/Union & Subtraction
for optimal decomposition search also lies in its offline nature,
thereby incurring no overhead for online prediction services.
3) Ablation Study for the Hierarchical Multi-scale Network:
We conduct an ablation study to verify the effectiveness of our
proposed hierarchical spatial modeling and scale normaliza-
tion modules for multi-scale ST learning (RQ3). One4All-ST
(w/o HSM) removes the hierarchical spatial modeling module,
making every scale learn its ST representation from scratch
instead of learning from the previous scale’s representation.
One4All-ST (w/o SN) applies a single standard normalization
transformation to all scales.
Table IV displays ablation results on the Taxi NYC dataset.
Notably, One4All-ST (w/o HSM) underperforms One4All-ST at
every scale, with the advantage of One4All-ST becoming more
pronounced as the scale increases (11.8% RMSE reduction
in Task 4). This suggests that hierarchically learning multi-
scale spatial representations benefits more for coarse scales.
Besides, One4All-ST (w/o SN) performs much worse than
One4All-ST, especially for fine scales (RMSE doubling on
Task 1 & 2). This underscores the significance of employing a
scale normalization layer, as applying the same transformation
across all scales can disproportionately emphasize coarse
scales, neglecting fine scales. Achieving a balance across all
scales is crucial for modifiable areal units that may necessitate
consideration at various scales.
TABLE IV: Ablation results of the hierarchical multi-scale
network. best is in bold and italic. ‘HSM’ and ‘SN’ are
the abbreviations of Hierarchical Spatial Modeling and Scale
Normalization.
Taxi NYC
One4All-ST
One4All-ST
One4All-ST
(w/o HSM)
(w/o SN)
Task 1
RMSE
18.36
34.59
17.48
MAPE
0.108
0.228
0.104
Task 2
RMSE
24.41
41.16
22.74
MAPE
0.107
0.184
0.099
Task 3
RMSE
49.14
69.46
44.45
MAPE
0.113
0.157
0.099
Task 4
RMSE
125.0
135.1
110.2
MAPE
0.091
0.150
0.082
4) Effect of Hierarchical Structure:
As mentioned in
Sec. V-A, we construct the hierarchical structure using a 2×2
merging window, resulting in 0.72M parameters. To explore
how performance is affected by the merging window size
(RQ4), we experiment with window sizes of 3 × 3 and 4 × 4,
which produce hierarchical structures {1, 3, 9, 27} (0.54M
parameters) and {1, 4, 16} (0.46M parameters), respectively.
Fig. 14 shows results of different hierarchical structures
based on One4All-ST. In these structures, the 2×2 variant per-
forms the best, aligning with our expectations. The 2×2 variant
predicts more dense scales but requires more parameters than
the other variants. However, on comparing the 3×3 and 4×4
variants, the 3 × 3 variant generally underperforms, despite
having more parameters and predicting additional scales. This
discrepancy could stem from the greater importance of Scale 4
over Scale 3 for the given census tract and road segmentation
queries. Another reason may be the extension of the atomic
raster by zero-padding (to make the raster divisible to 3),
incurring background noise to spatial representation learning
for the 3 × 3 variant.
The above analysis highlights the crucial role of hierarchical
structures in modifiable areal unit prediction. In this paper, we
opt for the 2×2 merging window, achieving outstanding pre-
diction performance with a modest computational cost (0.72M
parameters). Furthermore, recognizing that region queries may
prefer specific scales, we suggest a potential future direction
for finding optimal hierarchical structures under resource-
limited scenarios if region query scales could be pre-known.
5) Analysis of Query Response Time: To test our system’s
capability to handle online services for predicting arbitrary

modifiable areal units, we display the response times of
different tasks on Taxi NYC and Freight Transport datasets
in Fig. 15. Recall that our system’s online phase (Sec. III) in-
volves two steps. Firstly, the deployed ST model continuously
predicts crowd flow for all hierarchical grids and synchronizes
with HBase at preset intervals (e.g., 1 hour). Then, region
query prediction is achieved by retrieving and aggregating
the optimal prediction of decomposed grids based on their
index, eliminating the need for re-inferencing predictions.
Therefore, we calculate the response time for region queries
by summing decomposition time and indexing time. As shown
in Fig. 15, the average response time increases with the task
scale. Notably, the average response time for all four tasks
on both datasets is below 2 milliseconds, with a maximum
response time not exceeding 20 milliseconds. This already
meets the requirements for online use.
6) Effect of Spatial Modeling Block: We compare the
proposed One4All-ST (SEBlock) with the variants using either
the residual block (ResBlock) [26] or the standard convolution
block (ConvBlock) [33] as the spatial modeling block. As
shown in Fig. 16, SEBlock consistently outperforms Con-
vBlock and ResBlock by a significant margin in all cases,
demonstrating the effectiveness of incorporating channel-wise
information in feature maps and reducing MAPE up to 0.6%
compared to ResBlock, which is consistent with previous
research findings [13].
7) Analysis of Index Size: In Fig. 17, we calculate quad-
tree index sizes in both datasets at each scale. The quad-
tree indexes store optimal combinations for the hierarchical
structure P = {1, 2, 4, 8, 16, 32} with 150m×150m atomic
grids, supporting fine-grained ST prediction in a metropolis
like Shanghai and NYC. The total index sizes for Taxi NYC
and Freight Transport are relatively small at 66 MB and 64 MB
respectively, making them suitable for loading into a single
server for online region query processing.
VI. RELATED WORK
A. Deep Learning for Spatio-Temporal Prediction
Nowadays, deep neural networks are widely used for urban
spatio-temporal prediction. Existing ST prediction models are
categorized based on their input format into two main types: (i)
grid-based models, which divide the spatial domain evenly into
H ×W fine-grained mesh-grids and typically use a 4D tensor
RT ×H×W ×C as input [26], [51]; and (ii) graph-based models,
which incorporate directed or undirected graphs to leverage
the topological structure for modeling and usually take a 3D
tensor RT ×N×C as input (where N represents the number of
nodes). Convolutional neural networks (CNN) are commonly
used to capture local spatial dependencies in grid data, which
is inherently Euclidean [13], [26], [51]. In contrast, except for
Euclidean data, graphs are also well-suited for representing
non-Euclidean data. Therefore, graph neural networks (GNNs)
are beneficial for modeling irregular regions [46], [47] and
organizing a hierarchical structure with irregular partitions.
Additionally, GNNs can effectively capture long-range spatial
dependencies using either predefined graphs [11], [52]–[54] or
adaptive graphs [10], [12], [44]. In this paper, we use grids
to create a hierarchical structure because grids do not have
any prior preference for region partitioning (all grids are of
equal size), making them suitable for arbitrary modifiable areal
units. Unlike most previous studies that make predictions on a
specific scale, this paper proposes a framework that can give
ST predictions for arbitrary modifiable areal units, which is a
novel research problem and less studied in existing works.
B. Multi-scale Spatio-Temporal Learning
In recent years, we have witnessed several pioneer work
toward multi-scale spatio-temporal learning. Most of these
focus on learning spatio-temporal representations on two
scales (i.e., node level and cluster level). There are two main
approaches to generating clusters: (i) feature-based methods:
cluster the nodes based on the features of nodes (e.g., historical
traffic observations) [27], [55]. For example, MC-STGCN [27]
performs both fine- and coarse-grained traffic flow predictions.
The coarse-grained scale is clustered based on the topology
information of the road network and historical traffic flow
similarity. (ii) learning-based methods construct clusters using
learned node representations [13], [32]. For example, STRN
[13] predicts fine-grained urban flows by fusing coarse-grained
cluster representations. The cluster is constructed based on
high-level node representations extracted by the backbone
network. Inspired by previous works, we build a hierarchical
structure with different scale grids and design a more efficient
network for multi-scale predictions (using fewer parameters
and getting better results). More importantly, we study the
optimal combination problem of how to leverage the multi-
scale outputs for arbitrary modifiable areal unit prediction.
VII. CONCLUSION AND FUTURE WORK
This paper proposes One4All-ST, which predicts spatio-
temporal (ST) data for any modifiable areal unit using only one
model, aiming at reducing the expensive cost and alleviating
the prediction inconsistency caused by many ST models. To
reduce the cost, we propose a hierarchical multi-scale ST net-
work with spatial modeling and scale normalization modules
to efficiently and equally learn multi-scale representations. To
alleviate prediction inconsistencies, we propose a dynamic
programming scheme to find the optimal combination with
minimized predicted error for representing modifiable areal
units. For in-time response in practical online scenarios, we in-
troduce an extended quad-tree to index optimal combinations.
Extensive experiments on real-world ST datasets demonstrate
that One4All-ST achieves the best prediction accuracy over
competitive baselines with a much lower computation cost.
Our future work would include: (1) we will develop ap-
proaches to determine the optimal hierarchical structure for
further reducing computation costs in resource-limited scenar-
ios. (2) we will utilize GNNs to improve long-range spatial
dependencies modeling and explore hierarchical structures
with irregular partitions that can be represented as graphs and
modeled via GNNs.

REFERENCES
[1] F. Calabrese, M. Colonna, P. Lovisolo, D. Parata, and C. Ratti, “Real-
time urban monitoring using cell phones: A case study in rome,” IEEE
Transactions on Intelligent Transportation Systems, vol. 12, no. 1, pp.
141–151, 2011.
[2] H. Xu, A. Berres, S. B. Yoginath, H. Sorensen, P. J. Nugent, J. Severino,
S. A. Tennille, A. Moore, W. Jones, and J. Sanyal, “Smart mobility in
the cloud: Enabling real-time situational awareness and cyber-physical
control through a digital twin for traffic,” IEEE Transactions on Intelli-
gent Transportation Systems, vol. 24, no. 3, pp. 3145–3156, 2023.
[3] H. Yuan, G. Li, Z. Bao, and L. Feng, “An effective joint prediction model
for travel demands and traffic flows,” in 2021 IEEE 37th International
Conference on Data Engineering (ICDE), 2021, pp. 348–359.
[4] G. Li, X. Wang, G. S. Njoo, S. Zhong, S.-H. G. Chan, C.-C. Hung, and
W.-C. Peng, “A data-driven spatial-temporal graph neural network for
docked bike prediction,” in 2022 IEEE 38th International Conference
on Data Engineering (ICDE), 2022, pp. 713–726.
[5] S. Ling, Z. Yu, S. Cao, H. Zhang, and S. Hu, “Sthan: Transportation
demand forecasting with compound spatio-temporal relationships,” ACM
Trans. Knowl. Discov. Data, oct 2022.
[6] Y. Yao, B. Gu, Z. Su, and M. Guizani, “Mvstgn: A multi-view spatial-
temporal graph network for cellular traffic prediction,” IEEE Transac-
tions on Mobile Computing, vol. 22, no. 5, pp. 2837–2849, 2023.
[7] R.-G. Cirstea, B. Yang, C. Guo, T. Kieu, and S. Pan, “Towards spatio-
temporal aware traffic time series forecasting,” in 2022 IEEE 38th
International Conference on Data Engineering (ICDE), 2022, pp. 2900–
2913.
[8] Y. Cui, S. Li, W. Deng, Z. Zhang, J. Zhao, K. Zheng, and X. Zhou, “Roi-
demand traffic prediction: A pre-train, query and fine-tune framework,”
in 2023 IEEE 39th International Conference on Data Engineering
(ICDE), 2023, pp. 1340–1352.
[9] S. Guo, Y. Lin, L. Gong, C. Wang, Z. Zhou, Z. Shen, Y. Huang, and
H. Wan, “Self-supervised spatial-temporal bottleneck attentive network
for efficient long-term traffic forecasting,” in 2023 IEEE 39th Interna-
tional Conference on Data Engineering (ICDE), 2023, pp. 1585–1596.
[10] Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang, “Graph wavenet
for deep spatial-temporal graph modeling,” in Proceedings of the 28th
International Joint Conference on Artificial Intelligence, ser. IJCAI’19.
AAAI Press, 2019, p. 1907–1913.
[11] C. Zheng, X. Fan, C. Wang, and J. Qi, “Gman: A graph multi-attention
network for traffic prediction,” Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 34, no. 01, pp. 1234–1241, 2020.
[12] Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang, “Con-
necting the dots: Multivariate time series forecasting with graph neural
networks,” in Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, ser. KDD ’20,
2020, p. 753–763.
[13] Y. Liang, K. Ouyang, J. Sun, Y. Wang, J. Zhang, Y. Zheng, D. Rosen-
blum, and R. Zimmermann, “Fine-grained urban flow prediction,” in
Proceedings of the Web Conference 2021, 2021, p. 1833–1845.
[14] L. Wang, D. Chai, X. Liu, L. Chen, and K. Chen, “Exploring the
generalizability of spatio-temporal traffic prediction: Meta-modeling and
an analytic framework,” IEEE Transactions on Knowledge and Data
Engineering, vol. 35, no. 4, pp. 3870–3884, 2023.
[15] X. Geng, Y. Li, L. Wang, L. Zhang, Q. Yang, J. Ye, and Y. Liu,
“Spatiotemporal multi-graph convolution network for ride-hailing de-
mand forecasting,” Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 33, no. 01, pp. 3656–3663, 2019.
[16] Z. Pan, W. Zhang, Y. Liang, W. Zhang, Y. Yu, J. Zhang, and Y. Zheng,
“Spatio-temporal meta learning for urban traffic prediction,” IEEE
Transactions on Knowledge and Data Engineering, vol. 34, no. 3, pp.
1462–1476, 2022.
[17] C. Zheng, X. Fan, C. Wen, L. Chen, C. Wang, and J. Li, “Deepstd: Min-
ing spatio-temporal disturbances of multiple context factors for citywide
traffic flow prediction,” IEEE Transactions on Intelligent Transportation
Systems, vol. 21, no. 9, pp. 3744–3755, 2020.
[18] J. Yuan, Y. Zheng, and X. Xie, “Discovering regions of different
functions in a city using human mobility and pois,” in Proceedings of the
18th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, 2012, p. 186–194.
[19] L. Sun, X. Ling, K. He, and Q. Tan, “Community structure in traffic
zones based on travel demand,” Physica A: Statistical Mechanics and
its Applications, vol. 457, pp. 356–363, 2016. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0378437116300346
[20] D. W. Wong, “The modifiable areal unit problem (maup),” in World-
Minds: geographical perspectives on 100 problems.
Springer, 2004,
pp. 571–575.
[21] S. Openshaw, “The modifiable areal unit problem,” Quantitative geog-
raphy: A British view, pp. 60–69, 1981.
[22] S. C. de Andrade, C. Restrepo-Estrada, L. H. Nunes, C. A. M. Ro-
driguez, J. C. Estrella, A. C. B. Delbem, and J. Porto de Albuquerque,
“A multicriteria optimization framework for the definition of the spatial
granularity of urban social media analytics,” International Journal of
Geographical Information Science, vol. 35, no. 1, pp. 43–62, 2021.
[23] J. Jin, P. Cheng, L. Chen, X. Lin, and W. Zhang, “Gridtuner: Reinvesti-
gate grid size selection for spatiotemporal prediction models,” in 2022
IEEE 38th International Conference on Data Engineering (ICDE), 2022,
pp. 1193–1205.
[24] L. Chen, J. Fang, Z. Yu, Y. Tong, S. Cao, and L. Wang, “A data-driven
region generation framework for spatiotemporal transportation service
management,” in Proceedings of the 29th ACM SIGKDD Conference
on Knowledge Discovery and Data Mining, ser. KDD ’23, 2023, p.
3842–3854.
[25] J. Li, S. Wang, J. Zhang, H. Miao, J. Zhang, and P. S. Yu, “Fine-
grained urban flow inference with incomplete data,” IEEE Transactions
on Knowledge and Data Engineering, vol. 35, no. 6, pp. 5851–5864,
2023.
[26] J. Zhang, Y. Zheng, and D. Qi, “Deep spatio-temporal residual networks
for citywide crowd flows prediction,” in Thirty-first AAAI conference on
artificial intelligence, 2017.
[27] S. Wang, M. Zhang, H. Miao, Z. Peng, and P. S. Yu, “Multivari-
ate correlation-aware spatio-temporal graph convolutional networks for
multi-scale traffic prediction,” ACM Trans. Intell. Syst. Technol., vol. 13,
no. 3, jan 2022.
[28] F. Wang, J. Xu, C. Liu, R. Zhou, and P. Zhao, “Mtgcn: A multitask
deep learning model for traffic flow prediction,” in Database Systems for
Advanced Applications: 25th International Conference, DASFAA 2020,
Jeju, South Korea, September 24–27, 2020, Proceedings, Part I, 2020,
p. 435–451.
[29] S. Wang, J. Cao, and P. S. Yu, “Deep learning for spatio-temporal
data mining: A survey,” IEEE Transactions on Knowledge and Data
Engineering, vol. 34, no. 8, pp. 3681–3700, 2022.
[30] “Apache hive,” https://hive.apache.org/, accessed: February 23, 2024.
[31] “Apache hbase,” https://hbase.apache.org/, accessed: February 23, 2024.
[32] Y. Ma, P. Gerard, Y. Tian, Z. Guo, and N. V. Chawla, “Hierarchical
spatio-temporal graph neural networks for pandemic forecasting,” in
Proceedings of the 31st ACM International Conference on Information
& Knowledge Management, ser. CIKM ’22, New York, NY, USA, 2022,
p. 1481–1490.
[33] J. Zhang, Y. Zheng, D. Qi, R. Li, and X. Yi, “Dnn-based prediction
model for spatio-temporal data,” in Proceedings of the 24th ACM
SIGSPATIAL International Conference on Advances in Geographic
Information Systems, ser. SIGSPACIAL ’16, 2016.
[34] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
B. Guo, “Swin transformer: Hierarchical vision transformer using shifted
windows,” in 2021 IEEE/CVF International Conference on Computer
Vision (ICCV), oct 2021, pp. 9992–10 002.
[35] K. Bi, L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, “Accurate
medium-range global weather forecasting with 3d neural networks,”
Nature, vol. 619, no. 7970, pp. 533–538, 2023.
[36] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2018, pp. 7132–7141.
[37] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie,
“Feature pyramid networks for object detection,” in 2017 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), 2017, pp.
936–944.
[38] S. Arlinghaus and J. Kerski, “Spatial transformations and visualization:
Selected common threads and root concepts linking old to new,” Solstice:
An Electronic Journal of Geography and Mathematics, vol. Volume
XXV, 06 2015.
[39] “Union
(analysis),”
https://desktop.arcgis.com/en/arcmap/latest/tools/
analysis-toolbox/union.htm, accessed: February 23, 2024.
[40] R. K. V. Kothuri, S. Ravada, and D. Abugov, “Quadtree and r-tree
indexes in oracle spatial: A comparison using gis data,” in Proceedings

of the 2002 ACM SIGMOD International Conference on Management
of Data, ser. SIGMOD ’02, 2002, p. 546–557.
[41] C. Zhang, Y. Zhang, W. Zhang, and X. Lin, “Inverted linear quadtree:
Efficient top k spatial keyword search,” in 2013 IEEE 29th International
Conference on Data Engineering (ICDE), 2013, pp. 901–912.
[42] H. Yu, X. Xu, T. Zhong, and F. Zhou, “Overcoming forgetting in fine-
grained urban flow inference via adaptive knowledge replay,” in Proceed-
ings of the Thirty-Seventh AAAI Conference on Artificial Intelligence
and Thirty-Fifth Conference on Innovative Applications of Artificial
Intelligence and Thirteenth Symposium on Educational Advances in
Artificial Intelligence, ser. AAAI’23/IAAI’23/EAAI’23, 2023.
[43] W. Qian, D. Zhang, Y. Zhao, K. Zheng, and J. Q. Yu, “Uncertainty
quantification for traffic forecasting: A unified approach,” in 2023 IEEE
39th International Conference on Data Engineering (ICDE), apr 2023,
pp. 992–1004.
[44] Y. Zhao, X. Luo, W. Ju, C. Chen, X.-S. Hua, and M. Zhang, “Dynamic
hypergraph structure learning for traffic flow forecasting,” in 2023 IEEE
39th International Conference on Data Engineering (ICDE), 2023, pp.
2303–2316.
[45] K. Wang, L. Liu, Y. Liu, G. Li, F. Zhou, and L. Lin, “Urban regional
function guided traffic flow prediction,” Information Sciences, vol. 634,
pp. 308–320, 2023. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S0020025523004334
[46] J. Sun, J. Zhang, Q. Li, X. Yi, Y. Liang, and Y. Zheng, “Predict-
ing citywide crowd flows in irregular regions using multi-view graph
convolutional networks,” IEEE Transactions on Knowledge and Data
Engineering, vol. 34, no. 5, pp. 2348–2359, 2022.
[47] X. Wang, Z. Zhou, Y. Zhao, X. Zhang, K. Xing, F. Xiao, Z. Yang,
and Y. Liu, “Improving urban crowd flow prediction on flexible region
partition,” IEEE Transactions on Mobile Computing, vol. 19, no. 12, pp.
2804–2817, 2020.
[48] X. Tang, Z. T. Qin, F. Zhang, Z. Wang, Z. Xu, Y. Ma, H. Zhu, and
J. Ye, “A deep value-network based approach for multi-driver order
dispatching,” in Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, KDD 2019,
Anchorage, AK, USA, August 4-8, 2019, 2019.
[49] N. J. Yuan, Y. Zheng, and X. Xie, “Segmentation of urban areas using
road networks,” Tech. Rep. MSR-TR-2012-65, July 2012.
[50] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”
in Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, ser. KDD ’16, 2016, p.
785–794.
[51] J. Zhang, Y. Zheng, J. Sun, and D. Qi, “Flow prediction in spatio-
temporal networks based on multitask deep learning,” IEEE Transactions
on Knowledge and Data Engineering, vol. 32, no. 3, pp. 468–478, 2020.
[52] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent
neural network: Data-driven traffic forecasting,” in International Con-
ference on Learning Representations (ICLR ’18), 2018.
[53] D. Chai, L. Wang, and Q. Yang, “Bike flow prediction with multi-graph
convolutional networks,” in Proceedings of the 26th ACM SIGSPATIAL
International Conference on Advances in Geographic Information Sys-
tems, ser. SIGSPATIAL ’18, 2018, p. 397–400.
[54] C. Song, Y. Lin, S. Guo, and H. Wan, “Spatial-temporal synchronous
graph convolutional networks: A new framework for spatial-temporal
network data forecasting,” Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 34, no. 01, pp. 914–921, Apr. 2020.
[55] Q. Wang, B. Guo, Y. Ouyang, K. Shu, Z. Yu, and H. Liu, “Spatial
community-informed evolving graphs for demand prediction,” in Ma-
chine Learning and Knowledge Discovery in Databases. Applied Data
Science and Demo Track: European Conference, ECML PKDD, 2020,
p. 440–456.
